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ABSTRACT 

this  paper s*Jd  dismss  some  reseirch  issues  related  to 
the  general  topic  of  optimising  a  stochastic  system  via  simu¬ 
lation.  In  particular,  we  devote  extensile  attention  to  finite- 
difference  estimators  of  objective  function  gradients  and  pre¬ 
sent  a  number  of  new  limit  theorems, 
family  of  orthogonal  function  appro: 
behavior  of  the  objective  function.  We  show  that  if  the  ob¬ 
jective  function  is  sufficiently  smooth,  the  convergence  rate 
can  be  made  arbitrarily  close  to  n“*'a  in  the  number  of  ob¬ 
servations  required.  The  paper  concludes  with  a  brief  discus¬ 
sion  of  how  these  idea*  can  be  integrated  into  an  optimisation 
algorithm.  ^  | 


’e  alto  discuss  a  new 
itions  to  the  global 


-P-  )  - 


1.  INTRODUCTION 

In  recent  years,  considerable  attention  has  been  de¬ 
voted,  in  the  simulation  literature,  to  the  development  of  al¬ 
gorithms  for  optimizing  complex  stochastic  systems.  In  this 
paper,  we  shall  focus  on  describing  some  of  the  basic  issues 
that  arise  in  the  study  of  numerical  optimization  routines  for 
finite-dimensional  continuous  parameter  optimization  prob¬ 
lems. 

To  precisely  describe  the  class  of  problems  that  we  shall 
consider,  let  0CA  be  the  decision  parameter  over  which  the 
optimization  is  to  occur;  the  set  A  C  IR*  is  the  admissible 
set  of  decision  parameters.  For  each  0CA,  let  (Q,  T ,  Pi)  be 
the  associated  probability  space.  The  probability  measure 
P$  describes  how  the  random  environment  is  affected  by  the 
choice  of  0.  For  each  OcA,  let  X (0)  be  a  real- valued  random 
variable  corresponding  to  the  ‘‘cost”  of  running  the  system 
under  0.  Then 


(1.1) 


0(0)  i  / 

J  n 


X(0,u>)P(du) 


is  the  expected  cost  of  running  the  system  under  parame¬ 
ter  0.  Assuming  A  is  some  open  subset  of  Ft* ,  the  general 
finite-dimensional  continuous  parameter  stochastic  optimiza¬ 
tion  problem  involves  finding  0‘  CA  to  minimize  a($),  sub¬ 
ject  (possibly)  to  constraints  of  the  form 


&(*)=  I  Yi(0,u)P0(du)>  0, 
J  n 


1  <  I  <  m,  where  {Yi(0)  :  1  <  *  <  m}  is  a  collection  of 
“random  constraints." 

In  most  practical  applications,  the  objective  function 
o(0)  and  the  constraints  0i(0)  are  “smooth”  functions  of 
the  decision  parameter  0  (even  though,  typically,  X(0,u>) 
and  the  Y((0,Ui)'t  are  not  globally  smooth  in  0,  for  fixed 
ui).  Given  that  the  functions  o(0)  and  /3j(0)(l  <  t  <  m) 
can  be  cheaply  evaluated  without  error,  deterministic  mathe¬ 
matical  programming  techniques  may  be  applied  to  the  above 
optimization  problem.  Such  methods  typically  take  advan¬ 
tage  of  derivative  information  of  some  kind  (often  evaluated 
through  numerically  stable  finite-difference  approximations) 
Of  course,  in  the  context  of  a  complex  stochastic  system, 
the  objective  function  o(0)  and  the  constraints  0i(O)  will 
typically  be  evaluated  via  Monte  Carlo  simulation.  As  a  con¬ 
sequence,  there  will  be  random  error  associated  with  the  cor¬ 
responding  function  evaluations.  In  spite  of  the  presence  of 
such  error,  it  is  to  be  expected  that  derivative  information 
will  continue  to  play  an  important  role  in  the  development 
of  successful  optimization  algorithms  based  on  simulation. 
A  significant  portion  of  this  paper  is  therefore  devoted  to 
a  discussion  of  the  various  approaches  that  may  be  used  to 
calculate  derivatives  (or,  more  generally,  gradients)  via  sim¬ 
ulation. 

Section  2  is  devoted  to  a  discussion  of  the  convergence 
characteristics  of  finite-difference  estimators;  much  of  this 
material  appears  here  for  the  first  time.  In  Section  3,  we  de¬ 
scribe  a  class  of  unbiased  gradient  estimators  that  are  based 
on  likelihood  ratio  ideas.  Section  4  focuses  on  a  class  of  gra¬ 
dient  estimation  techniques  for  discrete-event  systems  known 
as  perturbation  analysis  methods.  The  estimators  of  Sections 
3  and  4  both  typically  attain  a  somewhat  faster  convergence 
rate  than  that  available  through  the  finite-difference  methods 
of  Section  2.  The  discussion  of  Sections  2  through  4  empha¬ 
sizes  the  scalar  setting  in  which  d  —  1;  Section  5  is  therefore 
devoted  to  describing  the  extension  of  these  ideas  to  the  case 
in  which  the  decision  parameter  0  is  vector- valued. 
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In  Section  6,  we  discuss  some  new  results  rdited  to 
global  approximation  at  the  objective  function  (and/or  con¬ 
straints)  by  orthogonal  functions  (specifically,  trigonometric 
polynomials).  One  way  to  apply  such  “surface  fitting”  tech¬ 
niques  |g  ti)  optimise  the  fitted  surface  (using  deterministic 
methods)  and  to  use  the  resulting  optimiser  as  an  approxi¬ 
mation  to  the  optimiser  of  the  true  surface. 

Section  7  is  devoted  to  a  discussion  of  how  the  results  of 
Section  2  through  6  can  be  used  in  an  optimisation  setting. 
Specifically,  we  discuss  some  of  the  convergence  theory  for 
the  Robbins- Monro  and  Kiefer- Wolfowits  algorithms. 

2.  FINITE- DIFFERENCE  ESTIMATORS 

In  this  section,  we  describe  some  of  the  finite-difference 
approximations  that  can  be  used  to  numerically  calculate  the 
derivative  of  a  function  <*(0)  of  the  foim  (1.1),  when  0  is 
scalar  (i.e.,  Oc]R).  In  Section  5,  we  discuss  the  special  con¬ 
siderations  that  arise  in  dealing  with  gradients  of  functions 
in  which  the  parameter  0  is  vector-valued  (i.e.,  0e52^). 


ii)  0  <  vu,B  X($o)  4  E,t(X(80)  -  a(0o))2  <  oo 

(£»(•)  denotes  the  expectation  operator  corresponding 

to  P»), 

iii)  <T2(0)  =  v&r#  X(0)  is  continuous  in  an  open  neigh¬ 
borhood  of  0o, 

iv)  0(0)  —  EfX (6)  is  infinitely  differentiable  in  an  open 
neighborhood  of  0o- 

The  following  theorem  states  that  if  the  difference  in¬ 
crement  is  chosen  optimally,  then  the  convergence  rate  of 
Aai(n,  ftB)  to  a'(0o)  is  n*1/4. 

(2.2)  THEOREM.  Assume  (2.1).  If  a"(0o)  0,  then: 

a)  if  nl^h„  — ►  oo  with  ft*  — ♦  0,  n1/4|Aoti(n,  hn)  - 
a'(0o)|  =>  OO  as  n  — ►  OO  (we  say  that  Zn  ^  oo  as 
n  — »  OO  if,  for  every  K  >  0,  P{Z„  >  K }  — *  1  as 
n  — *  oo), 

b)  if  n'/'hn  0,  n1/4|Aai(n,/»„)  -  a'(0o)|  =>  oo 
as  n  — *  oo, 

c)  if  nll*hn  — ►  ft  >  0,  then  nl/4(Acn(n,  ftn)  - 

o/(0o))  =*  £#*lN(0,l)  -  fto"(0o)/2  ..  n  - 

oo. 


2.1  Forward-Difference  Estimators 

Suppose  that  we  wish  to  estimate  o'(0o).  The  idea  here 
is  to  estimate,  via  simulation,  the  function  values  o{9q  + 
ft)  and  Or(0o)  and  to  form  a  corresponding  finite-difference 
approximation  to  a'(0o).  More  specifically,  let  X\ (0O  + 
ft).-^2(^0+ft),  ••  bei.i.d.  replicates  of  the  r.v.  X(0o+ft), 
simulated  under  common  distribution  P$a+h.  Similarly,  we 
letXi(0o),X2(0o),...be  an  independent  stream  of  i.i.d. 
replicates  of  the  r.v.  X(0q),  generated  under  common  dis¬ 
tribution  Pt0 .  Consider  the  forward-difference  estimator 


-Xt(60) 


The  determination  of  the  best  possible  difference  incre¬ 
ment  ft  introduces  a  trade-off  between  the  variance  of  the 
estimator  and  its  bias.  If  ft  is  chosen  too  small  (relative  to 
n),  the  variance  contribution  to  the  mean  square  error  will 
dominate,  whereas  if  ft  is  chosen  too  large  (relative  to  n), 
the  bias  will  govern  the  convergence  rate.  It  turns  out  that 
the  optimal  difference  increment  ft  =  ftn,  in  this  setting,  is 
typically  of  order  n-1^4.  To  rigorously  state  the  result,  we 
assume  that: 

(2.1) 

i)  P#{*(0)f-}  =*  /%,{*(« o)f  }  -  0  -  00  (=>  de¬ 
notes  convergence  in  distribution), 


A  a1(n,h)=1-j2(- 


(*o  + ft) 


The  proof  of  this  result  appears  in  the  Appendix.  (A 
similar  theorem,  under  different  hypotheses,  appears  in  FOX 
and  GLYNN  (1969).)  We  note  that  the  value  of  ft  which 
minimises  the  second  moment  of  the  limiting  r.v.  appearing 
in  c)  is 


Thus,  the  difference  increment  that  minimizes  asymptotic 
mean  square  error  is  ftn  =  ft*n-1^4.  This  result  was  ob¬ 
tained  previously  by  ZAZANIS  and  SURI  (1966).  It  it  worth 
observing  that  if  one  wishes  to  minimize  the  mean  absolute 
error  of  the  estimator,  then  the  optimal  difference  increment 
takes  the  form  ft„  =  ft.n-1^4,  where  (typically)  ft.  ^  ft*. 
(To  see  this,  observe  that  ft.  would  be  obtained  by  minimiz¬ 
ing  the  first  absolute  moment  of  the  limiting  normal  r.v.  ap¬ 
pearing  in  c).)  Stated  more  abstractly,  the  L2  and  L1  error 
criteria  do  not  yield  precisely  the  same  sequence  of  optimal 
difference  increments. 

2.2  Central-Difference  Estimators 

Theorem  2.2  states  that  the  forward-difference  estima¬ 
tor  converges  (at  best)  at  rate  n-1^4  to  the  derivative 
cr'(0o).  One  way  to  improve  upon  this  poor  convergence 
rate  is  to  instead  use  a  central-difference  approximation  to 
the  derivative.  When  function  evaluations  are  made  with¬ 
out  error,  this  is  known  to  be  a  numerically  more  accurate 
approximation  to  the  derivative. 
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To  precisely  define  the  estimator,  we  let  Xi(8q  +  A), 
Xj(0o  +  A), ...  be  ii.d.  replicates  of  the  r.v.  X (8q  +  A), 
simulated  under  common  distribution  PtB+h-  Similariy,  we 
let  Xi{0o  —  A),  Xi($o  ~  A), ...  be  an  independent  stream 
of  ia.d.  replicates  of  the  r.v.  X(8q  —  A ),  generated  under 
Pt0-h-  The  central- difference  estimator  is  defined  as 


-Xi(8a-hY 


a»0.*)-lg(*g*a5* 


The  following  theorem  summarizes  the  behavior  of 
Aaj(n,  A);  the  proof  is  similar  to  that  of  Theorem  2.2  and 
is  omitted. 

(2.3)  THEOREM.  Assume  (2.1).  If  Qr^(#o)  ^  0,  then: 

.)  i fn'/fAn  — »  oo  with  An  — ►  0,  n1/3|AQj(n,/i„)  — 
a'(^o)|  =>  oo  as  n  — *  oo, 

b)  if  n1/6A„  — ♦  0,  n1/3|a2(n,/»n)  -  a'(0o)|  =>•  oo  as 
n  —  oo, 

c)  if  n1/6A„  -*  A  >  0,  then  r»1/3(Aor2(n,  An)  - 

<*'(0o))  1)  -  ^a<3>(*0)  «•  n 


The  improved  convergence  rate  (of  order  n~1^3,  as  op¬ 
posed  to  n~ll*  for  forward  differences)  is  obtained  here  be¬ 
cause  of  the  fact  that  central  differences  are  less  biased  than 
forward  differences.  This  permits  the  difference  increment  to 
be  chosen  larger  (of  order  n-J/6,  as  opposed  to  n-1^  for 
forward  differences)  which,  in  turn,  reduces  the  variability  of 
the  estimator. 


The  choice  of  A  in  c)  that  minimizes  the  asymptotic 
mean  square  error  of  the  central  difference  estimator  is  (see 
also  ZAZANIS  and  SURI  (1966)) 


9<T2(6>q)  ' 

“  W3>(*o)J. 


2.3  Finite-Difference  Estimators  Using  Common 

Random  Numbers 

The  central-difference  estimator  improves  upon  the  con¬ 
vergence  rate  of  the  forward-difference  estimator  by  reducing 
its  bias  (for  fixed  A).  The  method  that  we  shall  describe  here 
improves  upon  the  convergence  rate  of  the  forward-difference 
estimator  by  reducing  its  variability  (for  fixed  A).  The  idea 
is  to  generate  the  replicates  of  X  (8q  +  A)  using  the  tame 
stream  of  random  numbers  that  were  used  to  obtain  the  repli¬ 
cates  of  X(6q ).  This,  of  course,  is  nothing  more  than  the 


method  of  common  random  numbers,  as  applied  to  doivative 
estimation. 

Suppose  that  the  r.v.  Y (do  +  A)  is  produced  from  the 
same  stream  of  random  numbers  at  is  X(8q)  and  shares 
the  same  distribution  as  does  X(0o  +  A)  under  PfB+h- 
By  convention,  we  set  y(0o)  =  X(0o)-  Let  Ay(A)  = 
Y (0o  +  A)  —  Y (0O).  We  make  the  following  assumptions 
about  our  common  random  number  scheme: 


i)  P{Y(h)e  )  =  P#e+*{X(0o  +  A)c). 

ii)  EAY(h)*  =  A«Tj  +  ©(A)  ss  A  1  0,  where  ©j  >  0, 

iii)  there  exists  £  >  0  such  that  E|Ay(A)|2+t  =  h0i  + 
o(A)  as  A  J.  0, 

iv)  a(8)  =  EgX(0)  is  two  times  continuously  differen¬ 
tiable  in  an  open  neighborhood  of  00- 

Lei  Ayl(A),Ay2(A)1...  be  an  i.i.d.  sequence  of 
replicates  of  the  r.v.  Ay ( A ) .  The  forward-difference  com¬ 
mon  random  numbers  estimator  for  a'(0 o)  is  then  given  by 


a  /  l\  *  v"  AYi(A) 
Ao3(n,A)=-^— . 


Before  proceeding  to  a  statement  of  the  convergence  rate  the¬ 
orem  for  Aa3(n,  A),  we  pause  to  discuss  our  assumptions 
further.  Consider  the  typical  discrete-event  system.  Let 
j4(A)  be  the  event  that  Y (do  +  A)  experiences  a  change 
in  the  order  of  events  from  that  experienced  by  Y (do)-  On 
the  event  A(h),  Y (8o  +  A)  —  y(#o)  •*  typically  of  unit 
magnitude.  On  the  other  hand,  on  the  complement  of  -4(A), 
Y(0O  +  A)  —  y  (f?o)  i»  typically  of  order  A  in  magnitude. 
Also,  for  most  discrete-event  systems  P(A(h))  =  \h+o(h) 
for  some  X  >  0.  For  p  >  0,  write 


EAY(h)p  =  EAY(hyi(A(h))+ EAY(h)pI(A(h)c). 


This  decomposition  suggests  that  for  most  discrete-event  sys¬ 
tems,  EAY(hy  —  0ph  +  o(A)  for  P  >  1.  This  explains 
the  form  of  (2.4)  ii)  and  iii). 


(2.5)  THEOREM.  Assume  (2.4).  If  o"(0O)  ^  0  and  _ _ ^ 

nA„  — *  OO,  then: 

a)  if  n^3An  -»  oo  with  A„  — *  0,  n1/3|Aa3(n,  A„)  -  □ 

a'(^o)l  ^  oo  as  n  — >  oo,  □ 

b)  if  n’^A,,  — »  0,  n1/3|AQ3(n,  A„)  -  o'(0o)|  =>  oo  !l - 

as  n  —  oo, 

c)  ifnI/3Afl  — ►  A  >  0,  nI^3(Ao3(n,  An)-o'(0o))  => _ 

^,V(0,l)-f»"(*0)».n-.oo.  _  . .r/ 

I  A v'.  .’  VI*'./  C  o’Vps 


iDibi.  i 


We  note  that  the  convergence  rate  here  i«  of  order 
n~lf3,  the  aame  a*  that  obtained  earlier  for  the  central- 
difference  eetimator.  Observe  that  the  optimal  difference  in¬ 
crement  now  is  of  order  n-1^3,  which  is  much  smaller  than 
the  difference  increment  of  order  n-1/6  derived  for  central 
differences.  The  lower  variability  of  the  common  random 
numbers  estimator  is  what  permits  us  to  choose  the  smaller 
increment.  We  further  note  that  the  value  of  h  appearing  in 
c)  which  minimises  the  asymptotic  mean  square  error  of  the 
estimator  A<»3 (n,hn)  is 


A* 


Of  course,  we  can  also  combine  central  differences  and 
common  random  numbers.  Let  AcY(h)  —  Y (6q  +  h)  — 
Y{6 o  —  h).  The  following  assumption  is  the  analogue  of 
(2<); 


i)  P{Y(h)£  }  =  Pla+h{X(0o  +  h)e ■}, 

ii)  EAcY(h)7  =  hff 2  +  o(h)  as  h  J.  0,  where  tr2  >  0, 

iii)  there  exists  £  >  0  such  that  £,|AYe(A)|2+t  =  A/3 2  + 
o(A)  as  A  j  0, 

iv)  a(0)  —  EtX(9)  is  three  times  continuously  differen¬ 
tiable  in  an  open  neighborhood  of  6 0 . 


Let  AeYi(/>),  AeY2(A),  ...  be  an  i.i.d.  sequence  of 
replicates  of  the  r.v.  ACY(A).  The  central-difference  com¬ 
mon  random  numbers  estimator  of  a' (6$)  is  given  by 


Aa<(r»,h)  =  ^  ^ 

1=1 


A'Yj(h) 
2  A 


A * 


(  V/5 


2.4  Finite-Difference  Estimators  With  A  Near- 
Optimal  Convergence  Hate 

In  the  preceding  three  sections,  we  have  discussed  four 
different  finite-difference  estimators.  The  convergence  rate 
was  improved  from  order  n-1^4,  in  the  case  of  forward  differ¬ 
ences  with  independent  streams  of  random  numbers,  to  order 
n~2^S,  which  was  achieved  by  a  central  difference  estimator 
that  used  a  common  stream  of  random  numbers.  A  natural 
question  that  arises  here  is  whether  any  further  improvement 
is  possible.  In  particular,  can  one  obtain  finite-difference  esti¬ 
mators  for  the  derivative  that  achieve  a  convergence  rate  that 
is  arbitrarily  dose  to  the  best  possible  rate  for  a  Monte  Carlo 
procedure,  namely  n-1/2?  We  will  now  answer  this  question 
in  the  affirmative  by  developing  such  a  class  of  estimators. 

To  produce  the  type  of  estimator  that  we  have  in  mind, 
we  need  to  obtain  a  finite-difference  approximation  to  q'(6q) 
that  is  as  unbiased  as  possible.  Suppose,  for  the  moment, 
that  a  is  an  analytic  function  in  6.  Then 


(2.8)  *(*  +  A)=£>(b)W~. 

_  n . 

n=0 

Let  TfcCt  be  the  “shifted”  function  defined  by  (T/,a)(0)  = 
a (6  +  A).  We  further  let  Da  be  the  derivative  function 
specified  by  (Da){6)  —  The  expansion  (2.8)  may 

then  be  written  as 


The  proof  of  the  following  convergence  rate  theorem  for 
A 04(11,  h)  follows  the  same  lines  as  that  for  Ack3(n,A); 
the  proof  is  therefore  omitted. 

(2.7)  THEOREM.  Assume  (2.6).  If  a(3){0o)  /  0  and 
nh„  — >  OO,  then: 

a)  if  n1/5A„  — *  oo  with  A„  —  0,  n2/5|Ao4(n,/»n)  - 
a'(0o)|  =>  00  as  n  — ►  00, 

b)  if  nll*hn  — »  0,  n2/5|Aa4(n,  A„)  -  o'(0o)|  =>  oo 
as  n  — •  00, 

c)  ifn^5An  —  A  >  0,  n2/s(Ac»4(n,/tn)-a'(0o))  => 

^N(0,l)-^a(3^o)-n^oo. 


(2.9)  Th a  =  £  ^D"o. 

n! 

Proceeding  formally,  we  may  rewrite  (2.9)  in  terms  of  the 
operators  7\  and  D  as 

(2.10)  Th  =  £  Dn  =  exp (/»£>). 

n= 0 


Thus,  combining  common  random  numbers  and  central 
differences  improves  the  convergence  rate  of  the  derivative 
estimator  to  order  n-2^5.  Furthermore,  the  difference  in¬ 
crement  that  minimites  the  asymptotic  mean  square  error  of 
the  estimator  Q4 (n,  A„)  is  A„  =  A*n-1/S,  where 


We  now  wish  to  express  the  operator  D  in  terms  of  the  shift 
operator  7*  ■ 

D  =  l-\og(Th). 


A 


Expanding  the  logarithm  in  a  formal  power  series,  we  obtain 


<a.») 


(as(h„,n)  -  a'(60))  =>  ^3^(0, 1) 

aa  n  — ►  00,  where  (f\  =  <7J(0o )Xm/h't  “><1 


To  obtain  a  finite-difference  approximation  to  a'(0O)  = 
(Da)(tfo).  we  truncate  the  aeriea  (2.11)  at  the  m’th  term: 


*=1 t=0  '  7 


Noting  that  (Tfc)*  =  T**,  we  obtain  the  following  approxi¬ 
mation  to  q'{9q): 


(MX  •'(«.)  *  J  EE  (!)  +  ">• 


To  obtain  a  finite-difference  eatiroator  for  a'(tfo).  we  let 
Xi(6q  +  ht),X3(6o  +  hi), ...  be  ij.d.  replicate*  of  the 
r.v.  X(6q  +  hi),  simulated  under  common  distribution 
P$o+ht(0  <  t  <  m).  We  further  generate  each  of  the 
m  -f  1  sequence*  independently  of  one  another  (i.e.,  the  se¬ 
quences  (Xi(6o  +  hi)  :  t  >  1)  are  mutually  independent 
for  0  <  t  <  m).  Set 


*<<*>  =  sEEO)  (-=^Xdh  +  ht). 

*=1 1=0  '  7 


-(Eiy+t(t({)i)’- 


According  to  Theorem  2.13,  the  convergence  rate  of 
at(hn,n)  may  be  made  a*  dose  as  we  wish  to  n~1/2,  hr 
choosing  m  sufficiently  Urge.  To  some  extent,  this  conver¬ 
gence  rate  is  deceptive.  Note,  in  particular,  that  the  constant 
Xm  is  increasing  in  171.  Furthermore,  the  construction  of  each 
observation  Zi(h)  that  enters  O5 (/»,  n)  requires  m  indepen¬ 
dent  simulations.  As  a  consequence,  the  computational  effort 
required  to  generate  05 (hn,Tl)  is  sensitive  to  the  choice  of 
m.  Thus,  although  the  convergence  rate  promised  by  The¬ 
orem  2.13  is  significantly  better  than  those  described  earlier 
in  this  section,  the  run-lengths  required  to  see  such  an  im¬ 
provement  may  be  quite  Urge. 


3.  LIKELIHOOD  RATIO  DERIVATIVE 
ESTIMATORS 

In  certain  settings,  it  is  possible  to  construct  derivative 
estimators  that  achieve  the  best  possible  rate  of  convergence 
for  a  Monte  Carlo  estimator,  namely  n-1^2  in  the  number 
of  observations  n  that  are  generated. 

Suppose,  for  the  moment,  that  the  distribution  defining 
a(0)  is  independent  of  0.  Then,  all  the  5-dependence  of  O 
sits  in  the  r.v.  X{0),  so  that  o(0)  —  EX(0).  Assum¬ 
ing  that  we  can  interchange  the  derivative  operator  and  the 
expectation,  we  get 


The  expectation  of  Zi(h)  then  matches  the  right-hand  side  of 

(2.12) .  We  then  obtain  a  finite-difference  derivative  estimator 
by  setting 

Our  next  theorem  describe*  the  convergence  rate  of 
0t$(hn ,  n),  when  the  difference  increment  it  chosen  appro¬ 
priately. 

(2.13)  THEOREM.  Assume  (2.1).  Ifm  >  1  and/l„n1/Jm 
— »  h  >  0  as  n  — *  00,  then 


a'(0o)  =  EW(0 o), 

where  W(50)  =  X'(0a).  Then,  by  generating  i.i.d.  repli¬ 
cates  of  W {So),  we  obtain  an  estimator  which  (use  the  stan¬ 
dard  central  limit  theorem)  possesses  the  canonical  conver¬ 
gence  rate  n-1^2.  The  idea  behind  the  likelihood  ratio 
method  (and  the  perturbation  analysis  approach  of  the  next 
section)  is  to  structure  the  representation  of  a  so  that  the 
driving  distribution  is  rendered  independent  of  9- 

Suppose  that  the  distribution  Pi  defining  Q  has  density 
L(9)  with  respect  to  some  common  distribution  P ,  so  that 
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(3.1)  Pt(du)  =  L(6,u)P(<kj). 

The  r.v.  L(0)  is  called  the  likelihood  ratio  of  Pg  (with 
retpect  to  P).  Under  this  assumption, 


<*(*)  =  /  X(0,U)Pt(du) 

J  n 

=  /  X(0,u)L(0,u)P(du) 
Jn 

=  f  Y(0,u)P(du) 

Jn 

=  EY(0), 


where  Y {O')  —  X(0)L(0).  This  is  the  desired  represen¬ 
tation  of  a.  Assuming  that  the  derivative-expectation  in¬ 
terchange  is  valid  (and  it  typically  is),  we  obtain  a'(0 o)  = 
EW(6o),  where  W{00)  =  X,(0o)L(0q)  + X($o)L'(0o). 
Hence,  the  key  to  obtaining  likelihood  ratio  derivative  esti¬ 
mators  is  finding  a  distribution  P  and  a  r.v.  L(6)  such  that 
(3.1)  holds  (at  least  for  0  in  an  open  neighborhood  of  6q). 


This  idea  is  easily  illustrated  when  the  basic  sample 
space  0  is  the  real  line.  Suppose  that  the  distribution  Pi 
takes  the  form  Pi(dx)  =  f(6,  z)fi(dx).  For  example,  if 
H(dx)  —  dx,  we  are  saying  that  P#  has  a  (Lebesgue)  density 
for  each  0.  Choose  g(x)  >  0  so  that 


/  9(*)»{dx)  =  1. 

Jr 

(This  can  always  be  done  if  fi  is  (T-finite.)  Set  P(dx)  — 
g(x)fi(dx)  and  observe  that  (3.1)  holds  with  L(0,x)  = 
/(#.  *)/$(*)• 

Suppose  that  we  are  interested  in  estimating  both 
q{0q)  and  o'(0o)-  Assume  that  the  set  A(0)  =  {x  : 
f(0,x)  >  0}  is  independent  of  $  in  a  neighborhood  of 
Bo-  A  particularly  convenient  choice  of  g(x),  in  this  case, 
is  g(x)  =  f(0 o,x).  Because  A(0)  is  independent  of  6,  the 
likelihood  ratio  L(0,z)  =  f(0,x)/f(0o,x)  is  well-defined 
in  a  neighborhood  of  0.  This  choice  of  g  has  several  advan¬ 
tages.  Note  that  to  generate  W(0q),  the  simulation  involves 
generating  outcomes  under  the  distribution  Pg0.  The  dis¬ 
tribution  P*0  is  typically  the  “natural"  distribution  for  esti¬ 
mating  flr(#o)'  Hence,  q(9q)  and  o'(6q)  can  easily  be  esti¬ 
mated  from  the  same  set  of  sampling  experiments.  A  second 
advantage  is  that  by  choosing  g  in  this  way,  L(8q)  =  1  and 
the  formula  for  W(9r>)  simplifies.  In  fact,  in  most  applica¬ 
tions,  the  calculation  of  L'(Bq)  also  simplifies  considerably, 
when  g  is  chosen  so  that  g(x)  —  f(8a,  x).  Furthermore,  in 


many  sampling  settings,  this  choice  of  g  leads  to  an  estimator 
W (0o )  that  has  desirable  variance  properties. 

As  the  above  discussion  suggests,  an  important  issue 
in  the  development  of  likelihood  ratio  gradient  estimators  is 
the  construction  of  a  likelihood  ratio  (for  a  given  class  of 
discrete-event  simulations)  that  has  desirable  computational 
and  variability  characteristics.  For  example,  it  turns  out  that 
in  a  discrete-event  simulation  context,  the  likelihood  ratio 
typically  exists  only  for  terminating  simulation  problems.  Of 
course,  steady-state  characteristics  can  be  analysed  as  a  limit 
of  finite- horison  estimation  problems.  Unfortunately,  the  as¬ 
sociated  likelihood  radios  become  successively  more  unstable 
as  the  time  horison  gets  targe.  However,  this  problem  can 
be  avoided  if  the  discrete-event  system  has  the  right  kind  of 
structure  (typically,  regenerative  structure).  REIMAN  and 
WEISS  (1986)  discuss  some  of  the  relevant  ideas. 

4.  PERTURBATION  ANALYSIS  DERIVATIVE 
ESTIMATORS 

In  Section  3,  we  described  the  likelihood  ratio  approach 
to  derivative  estimation.  The  basic  idea  was  to  use  the 
method  of  likelihood  ratios  so  as  to  obtain  a  representation 
of  a  in  which  the  driving  distribution  is  independent  of  6. 
In  this  section,  we  describe  an  alternative  technique  for  ob¬ 
taining  such  a  representation.  The  idea  is  to  return  to  the 
Common  random  numbers  technique  described  in  Section 
3.3.  Suppose  that  we  can  find  a  probability  space  (fi,  J~  ,  V) 
and  a  family  of  r.v.’s  { Y (/l)  :  |/l|  <  e}  such  that: 


(4.1)  P{V(/»)E}  =  P,e+h{X(0o  +  h)e }. 


Under  this  assumption,  it  follows  that 


0(00  +  />)  =  EY{h) 


for  |  A  |  <  f .  Assuming  that  we  can  interchange  the  derivative 
and  expectation  operators,  we  find  that  O^o)  =  EW (0o). 
where  lV(0o)  =  J"(0).  Hence,  by  generating  i.i.d.  repli¬ 
cates  of  the  r.v.  [V(0 o),  we  obtain  an  estimator  that  achieves 
the  n-1/3  convergence  rate  that  is  best  possible  for  a  Monte 
Carlo  procedure. 

As  in  the  case  of  the  likelihood  ratio  method,  this  tech¬ 
nique  is  best  illustrated  when  the  basic  sample  space  is  the 
unit  interval.  Let  0  =  [0, 1],  T  =Borel  sets  of  [0, 1],  and 
let  P  be  uniform  distribution  on  fl.  Set  U (u>)  —  U)  and 
observe  that 
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Y(h)  =  F^(U) 

wtisfioi  (4.X)  (where  Fh(x)  =  Pt0+h{X(9o  +  h)  <  z}). 
To  calculate  W(0 o),  we  need  to  determine 
Assume,  for  the  moment,  that  Fh  has  a  (Lebesgue)  density 
fh  for  each  h.  By  definition  of  the  inverse  distribution  func¬ 
tion  Fh ,  we  have 

Fh(F^(x))  =  z. 

Differentiating  both  aides  of  the  above  expression  with  re¬ 
spect  to  h,  we  get 

J-nt/r1**))  +  Mf;Hx))-^f;\x)  =  0, 

from  which  we  obtain 


=  -^Fh(F-(z))/A(F-(F-(z)). 

Thus,  in  this  setting,  we  find  that 


common  random  numbers  technique  can  still  be  applied,  in 
spite  of  the  non-existence  of  a  density.  For  example,  suppose 
that  X($)  —  y(0,  Y)  and  Y  has  distribution  F(y/6)  un¬ 
der  P$.  It  we  set  Y(h)  —  (0o  +  A)Y/0 o  (00  >  0),  then 
(4.1)  is  satisfied  with  P  =  P$0,  and  we  find  that 


a(0o  4-  h)  =  E$og(60  4-  A,(0O  4-  h)Y/60). 

If  y  is  smooth,  it  is  dear  that  the  derivative  of  g  (0o +h,  (0o4- 
h)Y/0o)  exists,  regardless  of  the  nature  of  the  distribution 

F. 

The  representation  (4.2)  (for  VY(0o))  can  be  derived 
via  an  altemsaive  argument.  Recall  that  o(0)  =  EfX(d). 
As  a  consequence,  it  X(0)  is  non-negative,  we  find  that 

a(0o  4-  h)  =  Eto+H  r  I{X{9 o  +  ft)  >  x)dx 
Jo 

(l  —  Fh(x))dx. 

Hence,  assuming  that  the  derivative  and  integral  operators 
can  be  interchanged,  we  obtain 


WWo)  =  -jfiFk(F;l(U))/h(Ftl(U))\hs  o. 


(4.4)  o'(0O  )~-J0  ^Fk(x)dx\h- 0 


One  undesirable  feature  of  the  approach  that  we  have  just 
outlined  is  that  since  we  have  taken  our  basic  probability 
apace  as  uniform  distribution  on  [0,  1],  the  generation  of 
!V(0o)  appears  to  require  inversion  (i.e.,  calculation  of 
Fh  1  (•)).  Recall,  however,  that  Fjf  1  ( (/  )  has  the  same  distri¬ 
bution  as  Y(h)  (or,  alternatively,  X(6q  4  /l)  under  Pf0+h)- 
Hence,  W(0q)  (under  P)  shares  the  same  distribution  as 


(4.2)  -^Fk(X(0  o))//k(X(0o))U=o 


In  order  to  apply  the  Monte  Carlo  method  to  the  numerical 
evaluation  of  the  integral  appearing  in  (4.4),  we  need  to  rep¬ 
resent  it  as  an  expectation.  One  way  to  do  this  is  as  follows 
(assuming  Fh  has  a  density  fh): 


(4.5  ) 

o' (So) = -  r 

Jo 


&/*(*) 

/*(*) 


=  -^a|;F5(X(0o))/A(X(0o))|h=o 


(under  P$0).  The  advantage  of  the  representation  (4.2)  is 
that  we  can  generate  the  derivative  observations  using  pre¬ 
cisely  the  same  algorithm  as  that  used  to  estimate  a(0o) 
itself  (since  a(0o)  is  typically  estimated  by  generating  i.i.d. 
replicates  of  X(0q)  under  P#0 ).  See  GLYNN  (1967)  for  ad¬ 
ditional  details. 


which  is  just  (4.2).  It  is  interesting  to  note  that  an  alternative 
representation  of  the  expectation  Ct(0o  4-  h)  exists: 


(4.6)  a(0O  +  h)=  f  xFh(dx). 

Jo 


The  argument  that  led  to  (4.2)  appears  to  require  exis¬ 
tence  of  a  density  fh-  It  turns  out  that  in  many  settings,  the 


7 


If  Fh  has  a  density,  this  becomes 


a(0o  +  h)  =  /  zfh(z)dx. 

Jo 

Attaining  that  the  derivative-integral  interchange  is  valid,  we 


To  represent  the  integral  in  (4-7)  as  an  expectation,  we  use 
the  same  idea  as  in  (4.5): 


a'(0o)  =  EtoX(60) 


fuMWo)) 

fk(X{9o)) 


h=Q 


In  particular,  if  X(0)  —  X  (in  which  case  Ot(0)  =  EgX), 
we  obtain 


(4.8) 


o'(tfo)  =  Et  0X 


&fk(X) 

MX) 


h= o' 


It  turns  out  that  (4.8)  is  precisely  the  likelihood  ratio  deriva¬ 
tive  estimator  of  Section  3.  Hence,  in  this  simple  setting, 
the  common  random  numbers  approach  and  the  likelihood 
ratio  method  derive  from  the  two  analytical  representations 
(4.3)  and  (4.6)  for  the  mean  of  a  non-negative  r.v.  Since 
(4.3)  and  (4.6)  are  usually  obtained  from  one  another  by  an 
integration-by-parts,  it  follows  that  the  likelihood  ratio  and 
common  random  numbers  methods  are  related  through  an 
integration-by-parts  in  this  simple  context.  We  note,  paren¬ 
thetically,  that  if  X(0)  is  non-negative,  then 


It  turns  out  that  the  common  random  numbers  deriva¬ 
tive  estimation  method  described  above  can  be  applied  to  cal¬ 
culate  derivatives  of  performance  measures  for  discrete-event 
dynamical  systems.  The  subject  of  perturbation  analysis 
is  concerned  with  the  study  and  development  of  the  resulting 
estimators.  For  example,  consider  a  discrete-event  system 
in  which  the  measure  Pf  characterises  the  distribution  (over 
sample  trajectories)  when  the  event-sdieduling  distributions 
are  indexed  by  a  scale  parameter  6.  Now,  because  0  appears 
as  a  scale  parameter  in  the  event-scheduling  distribution,  we 
can  view  the  event-scheduling  r.v.’s  that  are  generated  as 
taking  the  form  0X\ ,  OXj, ...  for  r.v.’s  Xj ,  Xj,  -  •  ■  having 
distribution  independent  of  $.  For  discrete-event  systems 
in  which  the  probability  of  two  events  occurring  simultane¬ 
ously  is  aero,  a  small  perturbation  of  the  event  times  will 
have  no  effect  on  the  order  of  the  state  transitions  experi¬ 
enced  by  the  discrete-event  system.  The  effect  of  the  param¬ 
eter  0  will  reflect  itself  only  in  the  timing  of  the  sequence 
of  state  transitions.  Furthermore,  ss  HO  and  CAO  (1983) 
point  out,  the  manner  in  which  the  perturbation  propagates 
itself  through  the  sequence  of  event  timings  is  suitable  to  a 
highly  efficient  recursive  computation  (i.e.,  the  perturbation 
of  the  n’th  state  transition  epoch  is  easily  calculated  from 
that  of  the  (n  —  l)’st).  These  ideas  lead  to  an  easily  cal¬ 
culated  sample  path  derivative  for  discrete-event  systems  in 
which  the  event-scheduling  distributions  are  parameterized; 
see  SURI  (1987)  for  additional  details  on  the  nature  of  the 
infinitesimal  perturbation  analysis  (IPA)  derivative  compu¬ 
tation. 

As  described  above,  the  IPA  approach  to  derivative  esti¬ 
mation  focuses  on  derivative  estimation  relative  to  perturba¬ 
tions  in  the  event-scheduling  distributions.  In  many  queueing 
settings,  one  wishes  to  optimize  over  routing  probabilities, 
however.  Likelihood  ratio  methods  are  highly  flexible  and 
can  be  applied  in  a  straightforward  manner  to  such  prob¬ 
lems.  Recent  extensions  of  IPA  to  such  routing  probability 
derivative  estimation  problems  hold  significant  promise,  how¬ 
ever  (see  HO  and  CAO  (1985)). 


Etx{6) = P  r 

Jo 


t’-1P${X(0)>t’)dt 


for  p  >  0,  from  which  it  follows  that 


(4.9)  a'(0o)  =  -pE,oX(0 c)'-1 


j^Fk(x(e0)p)  | 

h(X(0o))  l»-o‘ 


Formula  (4.9)  generalizes  (4.5).  In  principle,  one  could  opti¬ 
mize  over  p  in  order  to  determine  that  p- value  which  yields 
a  derivative  estimator  with  the  minimum  variance. 


Empirical  evidence,  gathered  to  date,  appears  to  sug¬ 
gest  that  when  both  IPA  and  likelihood  ratio  methods  apply 
to  a  given  problem,  the  IPA  estimator  will  typically  be  more 
effidoit  (in  the  sense  of  having  lower  variability).  This  con¬ 
clusion  stems,  in  part,  from  the  fact  that  likelihood  ratio 
estimators  are  known  to  have  a  variability  that  increases  in 
a  roughly  linear  fashion  with  the  time  horizon  of  the  simula¬ 
tion;  see  REIMAN  and  WEISS  (1966). 

Some  care  must  be  taken  in  applying  IPA  techniques 
to  a  given  problem,  however.  The  difficulty  is  that  the  inter¬ 
change  of  derivative  and  expectation  operators  that  is  needed 
to  rigorously  justify  the  IPA  estimator  (see  (4.1))  may  some¬ 
times  be  invalid.  In  such  settings,  the  IPA  estimator  can 
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converge  to  the  wrong  quantity.  To  get  some  iom  of  the 
problem,  we  note  that  if  Y  ( h )  haa  a  well-behaved  derivative 
Y'(0)  at  h  —  0,  then  we  would  expect  that 


(4.10)  /»-2vw[y(A)  -  Y(0)]  -  varY'(O) 

aa  h  |  0.  (In  fact,  (4.10)  ia  a  sufficient  condition  for  permit¬ 
ting  the  interchange  of  derivative  and  expectation.)  Hence, 


(4.11)  £AY(/»)2  =  h2EY'(  0)2  +  oih2) 

aa  h  [  0.  Recall,  however,  that  in  Section  2.3,  we  argued 
that  the  typical  behavior  of  a  discrete-event  system  was  gov¬ 
erned  by  (2.4)  ii),  which  contradicts  (4.11).  The  difficulty  ia 
that  while  the  effect  of  the  perturbations  on  the  state  tran¬ 
sition  sequence  may  be  ignored  in  calculating  W'(6o),  it 
cannot  be  typically  ignored  in  calculating  q'(0q).  In  HEI- 
DELBERGER  et  al.  (1988),  this  point  ia  analysed  further. 
It  is  shown  that  conventional  IP  A  can  be  inconsistent  (in 
the  sense  of  convergence  to  an  incorrect  answer)  for  mul¬ 
tiple  customer-type  queueing  networks.  However,  conven¬ 
tional  IPA  turns  out  to  be  consistent  for  a  large  number  of 
performance  measures  associated  with  single  customer  type 
networks. 

Furthermore,  a  number  of  extensions  in  the  basic  IPA 
algorithm  hold  significant  promise  for  overcoming  the  diffi¬ 
culties  that  arise  in  the  multiple  customer  context.  In  partic¬ 
ular,  a  new  version  of  IPA,  known  as  smoothed  perturbation 
analysis  (SPA),  is  now  under  development.  The  idea  it  that, 
rather  than  work  with  the  “raw"  sample  path  Y (h)  itself, 
one  considers  instead  the  conditional  expectation  of  Y (h) 
with  respect  to  some  appropriately  chosen  conditioning  vari¬ 
able  Z  (appropriate  in  the  sense  that  E(Y(h)\Z)  is  easily 
calculated).  Since  a  conditional  expectation  involves  an  inte¬ 
gration  operation,  the  conditioning  ought  to  yield  a  process 
E(Y(h)\Z)  which  is  smoother  in  h  than  is  Y(h)  itself. 
As  a  consequence,  SPA  has  the  potential  to  deal  with  esti¬ 
mation  problems  for  whidi  classical  IPA  does  not  work;  tee 
GLASSERMAN  and  GONG  (1989)  for  further  details. 

5.  GRADIENT  ESTIMATION 

In  the  previous  three  sections,  we  have  described  deriva¬ 
tive  estimation  techniques  that  are  applicable  to  problems  in 
which  the  decision  parameter  6  is  scalar-valued.  The  meth¬ 
ods  of  Section  2  gave  rise  to  estimators  for  which  their  re¬ 
spective  convergence  rates  were  slower  than  n-1^3  in  the 
number  n  of  observations  n  that  were  generated.  On  the 


other  hand,  the  likelihood  ratio  and  perturbation  analysis 
techniques  that  were  described  in  Sections  3  and  4  attained 
the  canonical  convergence  rate  of  n-t/3. 


The  generalisation  of  these  ideas  to  the  setting  in  which 
0  is  vector- valued  is  straightforward.  The  partial  derivatives 
with  respect  to  each  of  the  co-ordinates  0i  is  easily  calculated 
in  the  same  way  as  the  scalar  derivatives  were  estimated  ear- 
ha.  However,  the  computational  complexity  of  calculating  a 
d-dimensional  gradient  is  highly  sensitive  to  d  and  is  an  issue 
which  is  specific  to  the  setting  in  which  6  is  vector-valued  (aa 
opposed  to  scalar  valued).  For  example,  note  that  a  forward 
difference  estimator  lor  the  d-dimensional  gradient  Vor(0o) 
involves  performing  simulations  at  the  d- f  1  parameter  points 
0Oi  +  hiei , . . . ,  So  +  hjtj,  where  tj  is  the  i’th  unit  vec¬ 
tor.  On  the  othe'  hand,  a  central  difference  approximation 
requires  simulating  at  the  2d  points  $o  i  hjej.  Thus,  a  cen¬ 
tral  difference  estimator  for  a  d-dimensional  gradient  requires 
roughly  twice  as  much  computational  effort  as  a  forward  dif¬ 
ference  estimator  to  obtain  the  same  number  of  observations. 
This,  however,  is  balanced  by  the  fact  that  the  convergence 
rate  of  a  central  difference  estimator  is  more  rapid  than  that 
of  a  forward  difference  estimator.  As  a  consequence,  we  see 
that  if  d  is  large,  a  forward  difference  estimator  may  be  more 
efficient  for  small  n.  If  n  is  large  enough,  however,  the  cen¬ 
tral  difference  estimator  always  wins.  This  dimensionality 
effect  becomes  even  more  pronounced  for  the  “near  optimal" 
difference  estimators  of  Section  2.4.  Note  that  to  estimate 
a  d-dimensional  gradient,  simulations  at  the  md  -f  1  points 
6q,  00  +  /ti/ei, . . .  ,0q  -f-hjlcj  (1  <  /  <  rn)  are  needed. 
Hence,  the  dimensionalilty  degradation  that  occurs  with  this 
estimator  is  even  more  serious  than  that  experienced  by  the 
central  difference  estimators  discussed  earlier.  An  additional 
disadvantage  of  this  class  of  estimators  is  that  they  can  be 
quite  sensitive  to  numerical  round-off  error  when  m  is  large. 
(The  presence  of  the  alternating  sign  (  —  1  )1_*  can  lead  to 
numerical  instability.) 

Turning  now  to  the  likelihood  ratio  and  perturbation 
analysis  estimators,  we  note  that  both  of  these  estimators, 
when  applied  to  estimation  of  the  gradient,  require  only  a 
single  simulation  at  the  parameter  point  $o-  Of  course,  the 
additional  computer  time  required  to  calculate  the  d  partial 
derivatives  from  the  single  simulation  imply  that  the  com¬ 
putational  effort  to  compute  a  d- dimensional  gradient  is  still 
increasing  in  d.  However,  one  would  expect  that  these  esti¬ 
mation  algorithms  would  be  less  sensitive  to  d  than  are  the 
finite-difference  estimators  of  Section  2.  Thus,  the  likelihood 
ratio  and  perturbation  analysis  estimators  improve  upon  fi¬ 
nite  difference  estimators  in  two  ways:  computation  time  is 
less  sensitive  to  the  dimension  d,  and  the  convergence  rate  is 
n->/3. 
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«.  ORTHOGONAL  FUNCTION 
APPROXIMATIONS 

One  of  the  reaaocs  that  gradient  estimation  play*  a  key 
rote  in  optimisation  is  that  the  gradient  gives  information 
about  the  shape  of  the  objective  function.  When  such  shape 
information  is  added  to  that  supplied  by  a  function  evalua¬ 
tion,  we  are  essentially  being  given  an  affine  approximation 
to  the  function  in  a  neighborhood  of  the  point  at  which  the 
evaluations  occurred.  More  generally,  if  all  the  partial  deriva¬ 
tives  of  an  analytic  function  are  given  at  a  single  (fixed)  point, 
the  entire  global  behavior  of  the  function  is  then  determined. 
The  ability  to  obtain  global  information  about  the  behavior 
of  the  objective  function  is  dearly  useful  in  an  optimisation 
context. 


(The  Fourier  aeries  for  a  is  a  cosine  series  because  of  the 
fact  that  Or  is  an  even  function.)  The  functions  Co,  , . . . 
are  orthogonal  with  respect  to  the  inner  product  (6.2)  that 
we  have  defined,  in  the  s aue  that  (ek ,  C/)  =  0  far  i  j£  f. 
(In  fact,  they  are  orthonormal  since  (ek ,  ek)  =  1  for  k  > 
0.)  Hence,  (6.1)  expresses  a  as  a  linear  combination  of  the 
orthonormal  “vectors"  «o>  <1 1  e2>  •  •  ••  Thus,  we  can  estimate 
a  by  estimating  each  of  the  inner  products  a*=(cr,  tk)  for 
k  >  0.  In  contrast  to  expanding  a  function  in  a  Taylor  series, 
each  of  the  coefficients  at  is  defined  by  an  integral,  namely 


at 


I** 


o(0)d0, 
a(0)eoe  k9d0, 


k-0 
k>  1. 


As  indicated  above,  one  way  to  cheaply  infer  global  be¬ 
havior  is  via  a  Taylor  series  expansion  that  is  determined  by 
the  partial  derivatives  of  the  function.  Another  approach  in¬ 
volves  attempting  to  expand  the  function  in  an  orthogonal 
expansion  of  some  kind.  We  shall  now  illustrate  this  idea  in 
the  case  that  the  decision  parameter  0  is  scalar  vs'ued  and 
the  orthogonal  functions  are  the  trigonometric  functions.  In 
this  case,  we  will  then  obtain  a  Fourier-like  expansion  of  the 
objective  function. 


Monte  Carlo  methods  are  well  suited  to  estimating  integrals. 
In  particular,  suppose  we  generate  U  as  a  uniformly  dis¬ 
tributed  r.v.  on  the  interval  [0,  x]  and  then  simulate  X(U) 
under  the  distribution  Pu  -  Then,  ak  can  be  represented  as 

V2iEX(U),  k  =  0 
2-y* E X(u)  cos  kU,  k  >  1. 


Suppose  that  we  are  interested  in  studying  the  behav¬ 
ior  of  the  objective  function  over  the  interval  [0 ,  x] .  (By 
transforming  the  interval  if  necessary,  this  is  equivalent  to 
studying  a  over  an  arbitrary  compact  interval  of  the  form 
[a,  6].)  We  can  then  make  a  2x-periodic  by  extending  a 
to  [—  X,  0)  via  the  even  extension  Or(0)  —  <*(—  0)  and  then 
letting  a(0  +  2 x)  =  Or(0).  Assuming  a  is  continuously 
differentiable  on  [0,  x],  it  is  well  known  (see  FULKS  (1969), 
p.  547)  that  for  each  0c[—  X,  x], 


(6.1)  a(6)  =  ^2(a,ek)et{6), 

k=0 


where  the  ek't  are  the  normalised  cosine  functions  defined 
by 


«*(*)= 


( 


’ 

^7co®  k0, 


k  =  0 
k  >  1 


and 


Hence,  the  r.v. 


A(m,0)  =  X(U) 


cos(kU)  ■  cos  k6 


has  expectation 


E  A  (m,0)  =  y ^(a,ek)ek(9). 

k—0 


Thus,  A(m,  0)  is  an  unbiased  estimator  for  the  first  m  +  1 
terms  in  the  Fourier  series  of  a.  Note  that  only  one  simula¬ 
tion  is  required  to  estimate  the  first  TO  + 1  Fourier  coefficients 
of  a. 


Suppose  that  we  generate  n  i.i.d.  copies  Ai(m,£), 
. .  . ,  A n(m,  9)  of  the  r.v.  A(m,  9).  We  can  then  form  the 
estimator 


a„(0)  =  -£>(*»„, 0); 

FI 


is  1 


(6.2)  (*,V)=  J  z{9)y{9)d6. 


we  permit  m  =  mn  to  be  a  function  of  the  sample  size  n 
(since  m  will  have  to  grow  with  n  in  order  to  asymptotically 
remove  the  bias  of  the  estimator). 
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In  order  to  measure  the  distance  of  the  estimator  Qn(  ) 
from  the  function  a(-),  we  use  the  norm 


the  form  Xo(Uo)  +  2  J2iT=i  X*(Uk)  coe(6£4)  ccm(kO)), 
this  would  have  no  effect  on  (6.3).  In  other  words,  f||ckn  — 
tt|j2  is  unaffected  by  whether  the  coefficients  are  estimated 
independently  or  not. 

We  are  now  ready  to  state  a  limit  theoron  for  £||an  ~ 


Our  foal  is  to  describe  the  magnitude  of  the  distance  ||an  — 
a\\.  Let 


E"=,  XM), 

TZmlXi(Ui)caB(kUi), 


i  =  0 
k  >  1 


and  note  that 


“n(«)  =  5Za*(n)e*(^)- 

4=0 


If  a  is  continuously  differentiable  on  [0,  x],  then  (6.1)  is  valid, 
so  that 


<*n(0)-a(0)  =  ^(ak(n)-ot)ek(fl)-  ^ 

4=0  4 >m„ 


where  a*  —  (a,  C*).  Then,  the  orthogonality  of  the  e*’s 
guarantees  that 


(6.4)  THEOREM.  Let  6(0)  =  EtX2{0).  Suppose  that 
6(0)  is  continuous  on  [0,  x]  and  a(0)  has  a  continuous  p'th 
derivative  an  [0,  x](p  >  1).  If  m„  =  nl!2r ,  then: 

•)  EX2(lf)  +  2£r=i  EX2(U)ccx2(kU) 

~  m  f0  6(0)d0/x  as  m  — ►  oot 

b)  n1_7£||an— a||2  — *  Ossn  — *  OO  for  any  7  >  1/2 p, 

c)  n1/2-1'/2 ||a„  —  Qt||  ^  0  as  n  — ♦  OO  for  any  7  > 

1/2 P, 

d)  for  e  >  0,  m{0e[O,  x]  :  |a„(0)  -  a(0)|  > 
En'1'/2-1/2}  =>  0  as  n  — *  00,  for  7  >  1/2 p.  (m 
is  Lebesgue  measure.) 


Thus,  if  a  is  sufficiently  smooth,  we  can  obtain  a  global 
convergence  rate  arbitrarily  dose  to  n-1^2  in  the  num¬ 
ber  of  observations  n  that  are  simulated.  (We  note  that 
because  a  is  2x-periodic  and  is  an  even  function,  a’s  first 
p  derivatives  must  vanish  at  0  and  X  in  order  to  satisfy  the 
smoothness  hypothesis  of  Theorem  6.4.  If  a  does  not  sat¬ 
isfy  the  condition,  one  can  shrink  a't  domain  of  definition  to 
[e,  X  —  c]  and  then  smoothly  extend  a  to  [0,  x]  in  order  to 
satisfy  the  smoothness  hypothesis.) 

T.  STOCHASTIC  APPROXIMATION 
ALGORITHMS 


||an -a||2  =  (at(n)  -  a*)2  +  a\. 

4=0  4>m, 


In  this  section,  we  briefly  describe  how  the  results  of 
the  previous  sections  can  be  integrated  into  an  optimization 
algorithms. 


Hence, 

(6-3) 

£|lar.  -all2  =  53 *ir®*(n)+  a* 

ksO  k>mm 

/  m. 

=  2xn-1  EX2(U)  +  2  £  EX2(U)  cos2  kU 

V  4=1 

-n_1£a4+  53  °*' 

4=0  4  >m„ 


Consider  the  unconstrained  problem  in  which  the  goal 
is  to  minimize  the  objective  function  a(0)  over  0E0T1.  The 
idea  is  to  develop  a  recursive  procedure  in  which  the  (n  + 
l)'st  iterate  is  likely  to  be  closer  to  the  minimiter  0*  than 
is  the  n’th  iterate.  Specifically,  suppose  that  it  is  passible 
to  generate  r.v.'s  W(0)  such  that  EW(0)  SS  Va(0);  as 
discussed  in  Sections  2-5,  this  can  be  done  either  through  a 
finite- difference  approximation  or  through  the  likelihood  ra¬ 
tio  and  perturbation  analysis  gradient  estimators  (in  which 
case  EW(9)  is  typically  equal  to  Va(0)).  Assuming  exis¬ 
tence  of  such  r.v.’s  14^ (0),  consider  the  recursion 


It  is  worth  observing  that  if  each  of  the  Fourier  coefficients 

were  to  be  estimated  independently  (so  that  A(m,  0)  takes  (7-1)  0n+l  —  &n  n  rVf,-fi, 
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where  r  is  a  given  d  X  d  matrix  and  P{Vn+i£A\6o,  Vo, 
=  P{W(fin)cA}.  Is  other  words,  the  r.v. 
K»+i  is  generated  by  »i mutating  a  copy  of  W(9„). 

In  the  case  that  E{Vn+i\0o,Vo, .  . .  ,9n,V„}  = 
Va(9n)  (a*  would  occur  if  the  methods  of  Sections  3  and  4 
were  used),  algorithm  (7.1)  is  known  as  the  Robbins-Monro 
algorithm.  Assuming  that  a  is  twice  continuously  differen¬ 
tiable,  the  optimal  choice  of  the  matrix  T  then  turns  out  to 
be 

r  =  H(0')-\ 

where  H  is  the  Hessian  of  second  derivatives;  see  POLYAK 
and  TSYPKIN  (1080)  for  details. 

It  is  particularly  illuminating  to  consider  (7.1)  in  the 
case  that  9  is  scalar-valued.  In  this  case,  (7.1)  takes  the  form 


(7.2)  0n+l  =  On  —  n  1cVrn+1. 

Note  that  if  C  >  0,  6„  +  j  has  a  tendency  to  be  smaller 
than  0n  when  q'(6„)  >  0,  and  has  a  tendency  to  increase 
when  a'(9„)  <  0.  As  a  consequence,  the  sequence  (6„  : 
n  >  1)  has  a  tendency  to  move  towards  a  point  9‘  for 
which  a'(0*)  =  0  and  a' (9)  >  0  (a'(tf)  <  0)  for  9  in 

a  neighborhood  to  the  right  (left)  of  9 * .  Any  such  9 *  must 
necessarily  be  a  local  mini  miser  of  a.  Thus,  the  algorithm 
(7.1)  appears  intuitively  reasonable. 

In  fact,  (7.1)  has  good  convergence  characteristics.  If 
£{K>+l|0O.  Vo,...,9nt  K»}  =  a'{0n)  and  d=  1,  RUP- 
PERT  (1982)  has  shown  that  under  suitable  regularity  con¬ 
ditions, 


(7.3)  n1/2f  ,/2(tfLBlj  -  0‘ )  =>  at- *S(f2°+l) 

as  n  — *  OO  (in  the  Skorohod  space  D[e,  oo),  £  >  0),  where 
B()  is  standard  Brownian  motion  and  a  and  d  are  certain 
problem-dependent  constants.  Setting  t  =  1  in  (7.3),  we 
conclude  that  0n  converges  to  0 *  at  rate  n-1^2  when  unbi¬ 
ased  estimators  of  the  gradient  are  available. 

On  the  other  hand,  in  certain  applications,  only  finite- 
difference  approximations  to  the  gradient  may  be  present. 
For  example,  suppose  d  =  1  and  that  P{Vn+\CA\9<}y  Vo, 

•  ■  -A.  Vn)  =  P{[X(9n  +  «*»/•)  -  X(9n  -  cn-1'6)] 
/2 cn'lt*eA),  where  X(0n  +  Cri~I/<)  is  simulated  (un¬ 
der  independently  of  X(9„  —  cn~^e)  (under 


P Here  a  central  difference  approximation  to 
the  derivative  is  being  used  (recall  that  is  the  opti¬ 

mal  difference  increment  as  specified  by  Theorem  2.3).  The 
resulting  minimisation  algorithm  is  known  as  the  Kiefer- 
Wolfowits  procedure.  As  one  might  expect,  some  degra¬ 
dation  in  the  convergence  rate  occurs  as  a  consequence  of 
the  finite-difference  approximation.  Specifically,  RUPPERT 
(1982)  shows  that  under  suitable  regularity  hypotheses, 

(7.4)  n1'3!1'3^  -0*)  =>  b1+b2rA~iB(t7A+1) 

as  n  — *  OO  (in  D[c,  oo)),  where  6j ,  tj ,  and  A  are  problem- 
dependent  constants.  Thus,  the  convergence  rate  of  the 
Kiefer- Wolfowits  procedures  in  which  a  central  difference  ap¬ 
proximation  is  used  to  estimate  the  gradient,  is  n-1/3  in 
the  number  of  observations  generated.  Note  that  this  con¬ 
vergence  rate  is  identical  to  that  discovered  in  Theorem  2.3. 
The  fact  that  the  convergence  rates  for  the  optimisation  al¬ 
gorithms  (7.3)  and  (7.4)  match  the  convergence  rates  of  the 
corresponding  gradient  estimators  indicate-)  the  pivotal  role 
that  gradient  estimation  plays  in  the  optimisation  setting. 

While  the  above  discussion  has  focused  on  uncon¬ 
strained  optimisation,  constrained  variants  of  (7.1 ) — (7.2)  are 
also  available.  Among  the  approaches  that  have  been  stud¬ 
ied  are  penalty  function  methods  and  Lagrange  multiplier 
techniques;  see  RUBINSTEIN  (1986)  for  a  more  extensive 
description. 

A  somewhat  different  philosophy  for  optimising  sto¬ 
chastic  systems  via  simulation  involves  the  idea  of  using  sim¬ 
ulation  to  develop  a  description  of  the  global  behavior  of 
the  objective  function  and  constraints.  The  orthogonal  func¬ 
tion  approximations  of  Section  6  would  represent  one  way  to 
obtain  such  global  descriptions.  Having  fitted  functional  ap¬ 
proximations  to  the  objective  function  and  constraints,  one 
can  then  use  deterministic  techniques  to  optimize  the  fitted 
surface.  One  then  uses  the  optimizer  of  the  fitted  surface  as 
an  approximation  to  the  optimizer  of  the  original  stochastic 
system. 
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APPENDIX 

Proof  of  Theorem  2.2.  Let  Xi(0)  —  Xf(0)  —  o(0)  be 
the  centered  version  of  Xi(0).  We  first  with  to  show  that 
when  h„  — *  0, 
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•an—*  00.  The  left-hand  aide  of  (A.4)  can  be  bounded  by 


n-i/3  (£*<(flo  +  M-W) 

(A.i)  \i=l  / 

=>  V2<t($0)N(Q,  1) 

aa  n  — *  OO.  Thia  followa  from  the  central  limit  theorem 
for  triangular  array*  (aee,  for  example,  CHUNG  (1974),  pp. 
205-209).  In  particular,  to  verify  Lindeberg'a  condition,  we 
obeerve  that  (2.1)  ii)  and  iii)  together  imply  that  Xj(0o  + 

An)  =>  Xi(9, 0)  and  EXi(0O  +  An)2  -  EXi(0 o)J,  from 

which  it  follow a  that  {A^(0o  +  An)2  :  n  >  f»o }  ia  uniformly 
integrable  (aee  Theorem  4.5.4,  p.  97,  CHUNG  (1974)).  The 
uniform  integrability  of  {X,(0o  +  An)2  :  n  >  no}  im- 
pliea  that  of  {(A,-(0o  +  A„)  —  Xi(^o))2  :  n  >  no}i  from 
which  Lindeberg’a  condition  ia  an  eaay  consequence.  Thia 
established  (A.l). 

We  now  note  that 


|2+« 


<  El/SY^hnW+'Kal+'K'/'n''3) 

P  /-L  \  — */2 


(nAn)‘ 


aa  n  — ►  OO;  thia  yields  Lindeberg’a  condition  (aince  nh„  —* 
OO).  The  proof  ia  completed  by  writing 


nl/3(Aots(n,  A„)  -  a'(tfo)) 

=  -7-1  ==  E  Ay^An)/^  -1-  n  V3b(hn), 
yhnTl  ' 

where  6(A)  =  EAYi(h)/h  —  q'(8q).  (To  obtain  the  de¬ 
sired  limit  theorem,  use  (A.3)  and  6(An)  =  a"($o)hn/2  + 

o(hn).) 


n1/4(Ao!(n,An)  -  a'(0o)) 

—  1/2  /  w 

(A.2)  =^f£*.(*o  +  An)-*i(0o) 

\»=1 

+  n1/46(A„) 


Proof  of  Theorem  2.13.  We  lint  observe  that  there 
exists  £  >  0  such  that  {hZi(h)  :  0  <  h  <  c}  is  uniformly 
integrable.  (Use  the  same  argument  as  that  employed  in  the 
proof  of  Theorem  2.2).  Hence,  Lindeberg’a  condition  may  be 
verified,  to  obtain  the  central  limit  theorem 


where  6(6)  =  (ar(0o  +  A)  —  a(8o))/h  —  o/(0o)-  It  i» 
evident  that  by  (2.1)  iv),  a(0o  +  A)  =  a(#o)  +  ha'($o)  + 
h2a"(0 o)/2  +  o(62),  so  that  6(6)  =  ha"(60)/2  +  o(6). 
The  theorem  then  follows  immediately  from  (A.l)  and  (A.2). 

Proof  of  Theorem  2.5.  The  proof  proceed*  along  the 
tame  basic  lines  as  in  Theorem  2.2.  We  first  note  that  (2.4) 
ii)  and  iv)  together  imply  that  <r2=varAy  (An)  =  6n<72  + 
°(Ar>)  -  ( a'(0o)A„  +  0(An))2  =  6„<r2  +  o(6„).  Let 
AVj(A)  =  A Yi(h)  —  EAYi(h).  We  wish  to  show  that  if 
6„  — *  0  with  nhn  — ►  OO,  then 


(A.3) 


Yi(hn)=><xN  (0,1) 

•  si 


a*  n  — *  OO.  To  obtain  (A3),  we  need  to  verify  Lindeberg'a 
condition  for  the  triangular  array  { AVi(An)/ \/nhn  :  1  < 
t  <  n,  n  >  1}.  But  the  Lindeberg  condition  reduces  here 
to  verifying  that  for  K  >  0, 


(A.4)  EiAY^I^AY^K)2  >  Ko\  n)  -  0 


(A.5)  IT1'2  £  An^A,,)^  <73^(0,!) 

i=i 

aa  H  — ►  OO.  The  next  step  ia  to  study  the  bias  term  0(h)  = 
EZi(h)  —  o'(8q).  We  start  by  observing  that  aince  Q  ia  m 
times  continuously  differentiable,  it  followa  that 


(A.6) 


k 

^a(0(o0)^l  +  o(Am)j 


1 

A 


r=0  '  *sl 2=0  '  / 


(~1ib)1~<r-MKAm-1) 

=  if;o(r)(Oo)Ar7r  +  o(Am-,), 
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where 


N  =  N(e)  such  that  |6(0)  —  <  £,  what  6^(-) 

the  piecewiae  constant  function  defined  by 


To  get  a  handle  on  the  yr  we  note  that 

X  =  log(l  +  e*  -  1) 

= £  — - 1)* + oii"*') 

=b-‘>‘+,£(*)<->>‘-4+<>< 

k=l t=0  rsO 

r=0  ’  k si  t=0  V  / 


xm+1) 


xm+l) 


=  J/Tr  +  0(tm+1). 


-»m) 


Comparing  coefficenta  in  X,  we  conclude  that  yr  —  0  for 
r  ^  1  and  71  =  1.  Substituting  in  (A.6),  we  conclude 
that  EZi(h)  =  a^(e0)  +  o(hm~l).  Hence.  0(h)  = 
o(hm~1).  To  obtain  the  desired  central  limit  theorem,  we 


n!%~’  (as(hn,n)  —  a'(0o)) 

=  --«*t  *•«*•) 


and  use  (A.5)  and  the  estimate  /?(hn)  = 

Proof  of  Theorem  6.4.  To  prove  a),  we  will  show  that 

EX3(U)  co *3(kU)  -  2-1  /'  6{0)d0/x  -  k  -  oo.  We 

start  by  observing  that 


(A.7)  £X2(l/)cos2(fc[/)  =  /  6(0)cos2(M)d0/i 

Jo 


Since  6(  )  is  continuous,  it  it  evident  that  6(-)  is  uniformly 
continuous  on  [0,*].  Hence,  for  every  C  >  0,  there  exists 


Since  COS2(Jc0)  <  1,  it  follows  that 


'  w  w 

| £  bN(0)  co*3(k$)d$  -  jf  b(6 )  co*3 (k0)d6 

(A.9)  |  j\N(0)d0-  j\(0)dB  <£. 


Choose  k  so  large  that  2 xN/k  <  £.  Then,  setting  / 
t*/2*J  ,  we  have 


bN(0)  co*3  (k0)d0 


r'U+D/N 


f  bN(0)co*3(k0)d0 

Jo  ■ 

"Z*  fwU+l)/N 

=  E  /  M#)cos2(M)d0 

J*i/N 

*Z}  /•'U+D/N 

=  E  /  coe2(W)^ 

fr'o  J*HN 

N- 1  ,  ,w  kj/N+wk/S 

=  E  bN(*i/N)l  /  cos2(u)dt» 

N-l  i  r  rfkj/N+2rl 

=  E  /  cos2(u)du 

JTj  *l  •'**>/* 

+  y  cos2  f  v  +  +  2t/J  drj 

=  E*"W/W)[T+j/ 

;=o  0 

cos2  +  2x/^  dtj 

Af-l  ,  fwk(J+l)/S 

=  E  /  *>/2 

f?o  k  J'i’lN 

N-l  .  ,w(*/W-«) 

+  E  bN^m-j;  Jo 


,w(*/N-«) 


rwk(j+lV» 


rw(*/AT-«) 


i.e., 


=  \f\N(0)d6 

N-l  ,  Mk/N-U) 

(A.10)  +  5^  bN(xj/N)~  / 

j=0  * J° 

[“*’  (V  +  TT  +  2,/)  -  5] dv 

(We’ve  used  the  feet  that  C062(ti)dtl  =  //+S#< 

sin2(u)rfu,  so  that  ft**** 

C062(ti  )du  =  2-1  //+2'/[cm2(u)  +  sin2(u)]du  =  xt.) 
But  | k/N  -2t\<2,»o 


and  thus 


*=m+l 

Hence,  if  =  n$p,  it  follows  that  ak 

—  0(n^»_l),  Furthermore,  part  a)  implies  that 


£X2(u)  +  2  £  EX3(u)  coe2(iti) 


1st 


=  0(n*“l); 


b)  then  fellow*  immediately  from  these  estimates.  Result  c) 
is  a  well-known  consequence  of  b).  For  d),  we  note  that 


n^lK-all2 

=  nl~y  f  ( an(6)-a(6))2d0 

>  em^0t[~x,x]  :  |an(0)  -  a(0)|  >  . 


(A.ll) 


W-l  1  r(k/N-2t ) 

/ 

iZ 0  k 

^cos2  +  Ixl^j  -  ij  dv|  <  Me 


where  M  =  max{|B(#)|  :  6e[0,  w] } .  Combining  (A.8)- 
(A.ll),  we  obtain 


/  6(0)co82(W)d<?-2-1  [  b(0)d0 
Jo  Jo 


<(M  +  2)c. 


Since  t  is  arbitrary,  we  obtain  our  desired  conclusion.  Turn¬ 
ing  now  to  b),  we  note  that  the  smoothness  of  Or  implies  that 
(see  FULKS  (1969),  p.  5S1) 


sup 

0<*<r 


4=0 


0(m^-p) 


so  that 


II® 


m 


=  0(m1~ip) 

4=0 


Part  b)  implies  that  Em{0e[—x,x]  :  |a„(0)  —  ct(0)| 
>  — ►  0  a»  n  — *  OO,  yielding  d). 
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